LCN Wafer Inspection

PART 1 — BIG PICTURE

In industrial desktop systems, failure handling is not a side topic. It is part of the core design.

In a normal business app, an error often means one request failed, one screen failed, or one save operation failed. In a wafer inspection machine system, an error can mean the machine is still moving while the UI thinks it stopped, the camera stopped delivering images but the workflow keeps running, or inspection results are only half saved while the operator believes the lot is complete.

That is why failure handling is a first-class concern.

Why “just catch exception” is not enough

A lot of engineers learn error handling as:

wrap code in try/catch
log exception
show message
continue

That is nowhere near enough for a machine-control system.

Because the real question is not just:

“Did something throw?”

The real questions are:

What kind of failure is this?
Is the operation safe to retry?
Has the machine state changed or not?
Is the workflow still trustworthy?
Can the operator continue?
Do we need to stop the machine?
Do we need to mark results as invalid?
Can we recover automatically, or do we need human intervention?

That is the real problem.

Why resilience is different in desktop + hardware systems

Web and backend systems usually deal with stateless requests. If one request fails, you retry, return 500, or let another instance handle the next request.

A desktop app controlling hardware is different:

it is long-running
it often owns live device connections
it holds in-memory workflow state
it interacts with machines that have physical behavior
it must keep UI state, machine state, and workflow state aligned

That makes failure handling harder.

Here are some real examples.

Machine disconnection

Suppose the app is sending stage movement commands to the machine controller. The Ethernet connection drops. The app may not know whether:

the command never reached the machine
the command was accepted but the response was lost
the machine is still moving
the machine entered alarm state

This is much worse than a failed HTTP request.

Camera timeout

The camera SDK waits for a frame and times out after 2 seconds. Is this a harmless delay? Is the trigger cable disconnected? Is the camera frozen? Did exposure settings make frame acquisition too slow? The app has to treat this as an operational problem, not just a thrown exception.

Vendor SDK failure

Vendor SDKs are often not clean, modern, predictable libraries. They may:

throw vague exceptions
return integer error codes
deadlock
hang
leak handles
fail only after hours of operation
become invalid after reconnect

So resilience is not just about .NET exceptions. It is also about defending your system from bad external components.

File save failure during inspection

Imagine inspection images are being saved during a run, and disk space runs out. The machine may still be generating results. Now you have a dangerous partial-success situation:

inspection physically happened
some results are in memory
some images are on disk
some metadata is in database
some output is missing

This is exactly the kind of production problem senior engineers think about all the time.

PART 2 — HOW IT ACTUALLY WORKS

Good resilience starts with classifying failures correctly.

1. Expected errors vs unexpected failures

This distinction matters a lot.

Expected errors

These are failures the system should anticipate as part of normal operation.

Examples:

machine not connected
command timeout
camera frame timeout
invalid recipe
file path unavailable
operator tries to start while machine is not homed
disk full
network share temporarily unavailable

These are not “bugs” in the classic sense. They are operational conditions. The system should handle them deliberately.

Unexpected failures

These are failures that indicate bugs, corrupt state, bad assumptions, or broken dependencies.

Examples:

null reference because internal state was inconsistent
unexpected vendor SDK exception type
race condition causing duplicate workflow completion
collection modified from wrong thread
impossible state transition
corrupted inspection result object

These should usually be treated more aggressively. Often you log, stop the workflow, move to safe state, and preserve evidence for debugging.

2. Exception handling strategy

A production system should have clear exception boundaries.

Do not scatter random try/catch blocks everywhere.

Instead, think in layers:

hardware adapter boundary
workflow boundary
background processing boundary
UI command boundary
application top-level boundary

Each boundary has a different responsibility.

Hardware adapter boundary

Convert messy SDK behavior into clean application-specific exceptions or result objects.

Example:

vendor timeout code becomes CameraTimeoutException
disconnected controller becomes MachineDisconnectedException

This isolates the rest of the system from vendor chaos.

Workflow boundary

Protect the inspection workflow from crashing unpredictably. If a fatal hardware error happens during inspection, the workflow should transition to Faulted or Stopping, not just die on a background thread.

UI command boundary

When an operator presses Start, Stop, or Load Recipe, you need user-safe feedback. The UI should not show raw stack traces. It should show a meaningful operational message.

Top-level boundary

Unhandled exceptions should be logged with full context and should force the app into a safe failure mode. In some cases, you may disable machine commands, show a fatal error screen, or require restart.

3. Retry, timeout, fallback, fail-fast

These words are often used loosely. In industrial systems, they need careful meaning.

Retry

Retry is appropriate only when the operation is transient and retry is safe.

Good candidates:

reconnect to machine status channel
save image to network share after temporary IO error
read non-critical telemetry again
query machine status again

Dangerous candidates:

send “start motion” again
send “dispense chemical” again
trigger camera again when unsure whether the first trigger succeeded
commit workflow completion twice

The key question is not “can it fail transiently?” The key question is “is it safe if I accidentally perform it twice?”

Timeout

Timeout is critical because vendor SDKs and hardware APIs often hang.

Without timeouts:

background threads get stuck forever
UI waits indefinitely
shutdown hangs
workflows never complete or fail cleanly

But timeouts must be tuned based on real machine behavior. Too short causes false failures. Too long makes the operator wait forever and delays recovery.

Fallback

Fallback means switching to a degraded but controlled mode.

Examples:

use local disk if network share fails
continue UI updates without thumbnails if image decode fails
use cached machine metadata if live read fails temporarily
allow offline review of existing results when machine is disconnected

Fallback is useful, but only when it preserves correctness. Never fake success.

Fail-fast

Fail-fast means stopping immediately when continuing would be unsafe or would destroy debuggability.

Examples:

machine/app state divergence detected
impossible workflow state
safety interlock alarm
corrupt recipe
result stream integrity broken in a way you cannot trust

In those cases, stopping early is not weakness. It is good engineering.

PART 3 — REAL PROBLEMS IN THIS SYSTEM

Now let’s apply this to:

A WPF desktop app controlling a wafer inspection machine

Machine command timeout

Suppose the app sends MoveToPositionAsync(x, y) and waits for completion.

A timeout may mean several different things:

command never arrived
machine accepted command but response was lost
machine is still moving
controller hung
motion completed but status polling failed

This is why timeout handling cannot simply do:

throw exception
show popup
continue

Instead, after timeout, the system usually needs a recovery sequence:

mark command as uncertain
stop issuing new commands
attempt status re-sync
query motion/alarm state
possibly issue safe stop
transition workflow into paused/faulted state
require operator acknowledgement if trust is lost

The important thing is that timeout creates uncertainty, not just delay.

Hardware alarm during active inspection

Suppose an alarm occurs while inspecting wafer die 423 of 800.

Now you have multiple parallel concerns:

machine physical state
workflow state
acquired images
analysis pipeline
partial persisted results
UI state

Good handling usually looks like this:

immediately stop accepting further work into pipeline
cancel or drain background stages
mark current item as incomplete or invalid
snapshot machine alarm information
preserve partial results with explicit status
transition UI into alarm mode
force operator decision: retry item, skip item, abort lot, recover machine

This is not just exception handling. It is controlled workflow degradation.

Partial workflow completion

This is one of the hardest real-world problems.

Example:

700 dies inspected successfully
50 have images saved but analysis not finished
10 were in-flight in memory
database commit for lot summary failed

What is the truth of the run?

A weak design loses trust here. A strong design explicitly models partial completion.

You need concepts like:

Completed
PartiallyCompleted
Faulted
ResultsPendingPersistence
RecoveryRequired

Without explicit status modeling, teams invent fragile boolean flags and end up with silent data corruption.

Losing synchronization between app state and machine state

This is a classic industrial failure mode.

The app thinks:

machine is idle
no active run
stage at home

The machine is actually:

still in run mode
paused on alarm
stage midway through motion
recipe loaded from previous job

This can happen after:

reconnect
app crash and restart
network drop
SDK reinitialization
machine reboot

When this happens, the system must re-synchronize deliberately. It cannot just resume normal operation.

A good recovery flow often includes:

query machine authoritative state
compare with local workflow state
detect mismatch categories
decide whether auto-recovery is allowed
otherwise require operator-assisted reconciliation

Example: “Machine reports inspection active, but application has no active job context.” That is not a popup problem. That is a controlled recovery problem.

Failure while streaming results or saving images

Real-time result streaming often involves multiple stages:

acquisition
preprocessing
analysis
visualization
persistence

If saving images fails but visualization continues, the operator might think everything is fine while forensic evidence is missing.

If analysis fails for some items but the stream keeps flowing, totals may look valid while defect data is incomplete.

That means each pipeline stage needs clear failure semantics.

Possible policies:

stop entire run on any persistence failure
continue inspection but mark run degraded
buffer temporarily and retry save
switch to emergency local spool folder
allow review-only mode until data integrity restored

There is no one universal answer. It depends on whether missing data is acceptable.

PART 4 — HOW WE USE IT IN .NET (PRACTICAL)

The practical .NET approach is to create clear error boundaries, domain-specific exceptions, safe timeout handling, and recovery-oriented workflow code.

1. Exception boundaries

A good pattern is:

low-level SDK layer translates raw failures
application service decides recovery action
UI layer shows operator-friendly message

Example: hardware adapter boundary

csharp

public sealed class MachineDisconnectedException : Exception
{
    public MachineDisconnectedException(string message, Exception? inner = null)
        : base(message, inner) { }
}

public sealed class MachineCommandTimeoutException : Exception
{
    public string CommandName { get; }
    public TimeSpan Timeout { get; }

    public MachineCommandTimeoutException(string commandName, TimeSpan timeout)
        : base($"Machine command '{commandName}' timed out after {timeout}.")
    {
        CommandName = commandName;
        Timeout = timeout;
    }
}

public interface IMachineController
{
    Task MoveToAsync(double x, double y, CancellationToken cancellationToken);
    Task StopAsync(CancellationToken cancellationToken);
    Task<MachineStatus> GetStatusAsync(CancellationToken cancellationToken);
}

csharp

public sealed class VendorMachineController : IMachineController
{
    private readonly IVendorSdk _sdk;

    public VendorMachineController(IVendorSdk sdk)
    {
        _sdk = sdk;
    }

    public async Task MoveToAsync(double x, double y, CancellationToken cancellationToken)
    {
        try
        {
            var success = await _sdk.MoveStageAsync(x, y, cancellationToken);

            if (!success)
            {
                throw new InvalidOperationException("Vendor SDK reported move failure.");
            }
        }
        catch (TimeoutException)
        {
            throw new MachineCommandTimeoutException("MoveTo", TimeSpan.FromSeconds(5));
        }
        catch (SocketException ex)
        {
            throw new MachineDisconnectedException("Machine connection lost during move command.", ex);
        }
        catch (VendorSdkDisconnectedException ex)
        {
            throw new MachineDisconnectedException("Vendor SDK reports machine disconnected.", ex);
        }
    }

    public Task StopAsync(CancellationToken cancellationToken)
        => _sdk.StopAsync(cancellationToken);

    public Task<MachineStatus> GetStatusAsync(CancellationToken cancellationToken)
        => _sdk.ReadStatusAsync(cancellationToken);
}

The important part is not the syntax. The important part is isolating vendor weirdness from the rest of the application.

2. Timeout patterns

Timeouts should not be hidden all over the place. They should be explicit.

csharp

public static class TaskTimeoutExtensions
{
    public static async Task<T> WithTimeout<T>(
        this Task<T> task,
        TimeSpan timeout,
        string operationName,
        CancellationToken cancellationToken = default)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        var delayTask = Task.Delay(timeout, timeoutCts.Token);

        var completed = await Task.WhenAny(task, delayTask);

        if (completed == delayTask)
        {
            throw new TimeoutException($"Operation '{operationName}' timed out after {timeout}.");
        }

        timeoutCts.Cancel();
        return await task;
    }

    public static async Task WithTimeout(
        this Task task,
        TimeSpan timeout,
        string operationName,
        CancellationToken cancellationToken = default)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        var delayTask = Task.Delay(timeout, timeoutCts.Token);

        var completed = await Task.WhenAny(task, delayTask);

        if (completed == delayTask)
        {
            throw new TimeoutException($"Operation '{operationName}' timed out after {timeout}.");
        }

        timeoutCts.Cancel();
        await task;
    }
}

Use it carefully:

csharp

await _machineController.MoveToAsync(x, y, cancellationToken)
    .WithTimeout(TimeSpan.FromSeconds(5), "Move stage", cancellationToken);

But here is the senior-engineer warning:

A timeout does not guarantee the underlying operation stopped.

This is especially true with vendor SDKs. Your app may stop waiting, but the machine may still be moving. So timeout must usually trigger reconciliation logic afterward.

3. Safe retry patterns

Blind retry is dangerous. Wrap only safe operations.

csharp

public sealed class RetryHelper
{
    public static async Task RetryTransientAsync(
        Func<Task> action,
        int maxAttempts,
        TimeSpan delay,
        Func<Exception, bool> shouldRetry,
        Action<int, Exception>? onRetry = null)
    {
        for (var attempt = 1; attempt <= maxAttempts; attempt++)
        {
            try
            {
                await action();
                return;
            }
            catch (Exception ex) when (attempt < maxAttempts && shouldRetry(ex))
            {
                onRetry?.Invoke(attempt, ex);
                await Task.Delay(delay);
            }
        }

        await action();
    }
}

Safe usage example for file save:

csharp

await RetryHelper.RetryTransientAsync(
    action: () => _imageStore.SaveAsync(image, path, cancellationToken),
    maxAttempts: 3,
    delay: TimeSpan.FromMilliseconds(500),
    shouldRetry: ex => ex is IOException || ex is UnauthorizedAccessException,
    onRetry: (attempt, ex) =>
    {
        _logger.LogWarning(ex,
            "Retrying image save. Attempt {Attempt}. Path={Path}, InspectionId={InspectionId}",
            attempt, path, inspectionId);
    });

But do not do this for uncertain machine commands like StartInspectionAsync() unless the command is designed to be idempotent or has a command ID that lets the machine reject duplicates.

4. Recovery flows after machine or workflow failure

The most important code in real systems is not “happy path” code. It is recovery code.

csharp

public sealed class InspectionWorkflowService
{
    private readonly IMachineController _machineController;
    private readonly IInspectionStateStore _stateStore;
    private readonly ILogger<InspectionWorkflowService> _logger;

    public InspectionWorkflowService(
        IMachineController machineController,
        IInspectionStateStore stateStore,
        ILogger<InspectionWorkflowService> logger)
    {
        _machineController = machineController;
        _stateStore = stateStore;
        _logger = logger;
    }

    public async Task RunInspectionAsync(InspectionJob job, CancellationToken cancellationToken)
    {
        try
        {
            await _stateStore.MarkRunningAsync(job.Id, cancellationToken);

            foreach (var die in job.Dies)
            {
                cancellationToken.ThrowIfCancellationRequested();

                await InspectDieAsync(job.Id, die, cancellationToken);
            }

            await _stateStore.MarkCompletedAsync(job.Id, cancellationToken);
        }
        catch (MachineCommandTimeoutException ex)
        {
            _logger.LogError(ex,
                "Machine command timeout during inspection. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Machine command timeout", cancellationToken);
            await _stateStore.MarkFaultedAsync(job.Id, "Machine timeout", cancellationToken);
            throw;
        }
        catch (MachineDisconnectedException ex)
        {
            _logger.LogError(ex,
                "Machine disconnected during inspection. JobId={JobId}",
                job.Id);

            await _stateStore.MarkRecoveryRequiredAsync(job.Id, "Machine disconnected", cancellationToken);
            throw;
        }
        catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
        {
            _logger.LogInformation(
                "Inspection canceled by request. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Canceled", CancellationToken.None);
            await _stateStore.MarkCanceledAsync(job.Id, CancellationToken.None);
            throw;
        }
        catch (Exception ex)
        {
            _logger.LogCritical(ex,
                "Unexpected fatal inspection failure. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Unexpected failure", CancellationToken.None);
            await _stateStore.MarkFaultedAsync(job.Id, "Unexpected fatal error", CancellationToken.None);
            throw;
        }
    }

    private async Task InspectDieAsync(string jobId, DiePosition die, CancellationToken cancellationToken)
    {
        // move, acquire, analyze, persist
    }

    private async Task TryEnterSafeStopAsync(string jobId, string reason, CancellationToken cancellationToken)
    {
        try
        {
            _logger.LogWarning(
                "Attempting safe stop. JobId={JobId}, Reason={Reason}",
                jobId, reason);

            await _machineController.StopAsync(cancellationToken);
        }
        catch (Exception ex)
        {
            _logger.LogCritical(ex,
                "Safe stop failed. JobId={JobId}, Reason={Reason}",
                jobId, reason);
        }
    }
}

Notice the mindset:

catch specific operational failures first
transition workflow state explicitly
attempt safe stop
never assume stop succeeded
preserve logs and recovery state

That is production thinking.

5. Logging with enough context

A log line that says:

Error during inspection

is almost useless.

In real systems, you need context rich enough to reconstruct the failure.

Good context often includes:

job/lot ID
wafer ID
die position
recipe version
machine state
command name
timeout value
camera ID
image path
thread or pipeline stage
correlation or operation ID
operator action that triggered it

Example:

csharp

_logger.LogError(ex,
    "Image save failed. JobId={JobId}, WaferId={WaferId}, DieX={DieX}, DieY={DieY}, CameraId={CameraId}, Path={Path}, Recipe={Recipe}",
    job.Id,
    wafer.Id,
    die.X,
    die.Y,
    camera.Id,
    imagePath,
    recipe.Version);

That is the difference between “we saw an error” and “we can actually debug it tomorrow.”

6. Operator-safe UI messaging

Never dump raw exception text to machine operators.

Bad:

“NullReferenceException at ImagePipeline.cs line 84”
“SocketException 10054”
giant stack trace popup

Better:

“Camera image acquisition timed out. Inspection has been paused.”
“Machine connection was lost. Reconnect and re-synchronize before continuing.”
“Inspection images could not be saved. Current run was stopped to protect data integrity.”

Keep technical detail in logs, not in operator messages.

A simple mapping pattern works well:

csharp

public sealed class OperatorMessageService
{
    public string ToOperatorMessage(Exception ex) => ex switch
    {
        MachineDisconnectedException =>
            "Machine connection was lost. Please reconnect and verify machine state.",

        MachineCommandTimeoutException =>
            "A machine command timed out. The system is verifying machine state before continuing.",

        IOException =>
            "Failed to save inspection data. Please verify storage availability.",

        _ =>
            "An unexpected system error occurred. The operation was stopped safely."
    };
}

PART 5 — COMMON MISTAKES (VERY REALISTIC)

1. `catch (Exception)` everywhere

This usually starts from good intentions. Teams want the app to “never crash.”

So they wrap everything.

Result:

real bugs get hidden
state corruption continues
workflows limp forward in invalid state
logs become noisy and useless
root causes become impossible to identify

Production consequence: the app looks stable on the surface, but operators start seeing strange behavior, missing results, frozen workflows, and random recovery issues.

2. Swallowing errors

This is one of the worst industrial mistakes.

Example:

csharp

try
{
    await SaveImageAsync(...);
}
catch
{
    // ignore
}

That is not resilience. That is data loss.

Production consequence:

missing images
inconsistent reports
invalid defect evidence
debugging nightmare because failure happened silently

3. Retrying dangerous operations blindly

Teams often build a “generic retry helper” and apply it to everything.

That is a trap.

Retrying a read is different from retrying a physical command.

Production consequence:

duplicate machine actions
inconsistent stage position
repeated device triggers
duplicate workflow commits
safety risk in real hardware scenarios

4. Showing technical exception text directly to operators

Operators need actionable operational guidance, not developer detail.

Production consequence:

confusion
wrong recovery action
unnecessary panic
support tickets with screenshots of meaningless stack traces

5. Not restoring system to safe state

Some systems detect failure correctly but do not perform safe shutdown or safe pause.

Example:

UI shows “inspection failed”
but acquisition thread still running
machine still executing
pipeline still buffering results

Production consequence:

system drift
app/machine desynchronization
more damage after the initial error than from the original error itself

PART 6 — PERFORMANCE & TRADE-OFFS

Retry cost

Retries are not free.

They add:

latency
duplicate load
queue buildup
slower recovery
operator waiting time

In real-time systems, aggressive retries can make the whole system feel hung. Sometimes one fast failure is better than three slow retries.

Timeout tuning trade-offs

Timeouts are always a balancing act.

If timeout is too short:

you get false alarms
you interrupt valid slow operations
operators lose trust in the system

If timeout is too long:

recovery is delayed
UI appears frozen
cancellation feels broken
stuck SDK calls occupy resources too long

Good timeout values usually come from real machine measurements, not guesswork.

You often want different timeout classes:

UI responsiveness timeout
machine command timeout
reconnect timeout
persistence timeout
shutdown timeout

Not everything should use “30 seconds.”

Fail-fast vs aggressive recovery

Fail-fast is better when:

correctness matters more than uptime
machine state is uncertain
duplicate execution is dangerous
results cannot be trusted

Aggressive recovery is better when:

the operation is read-only or idempotent
the failure is clearly transient
you can preserve correctness while retrying
operator disruption is costly

Senior engineers do not ask, “Should we always retry or always fail fast?” They ask, “What is the cost of being wrong in each direction?”

PART 7 — SENIOR ENGINEER THINKING

1. Experienced engineers classify failures

Strong engineers do not treat all errors equally.

They classify by questions like:

Is it transient or persistent?
Is it expected or unexpected?
Is it safe to retry?
Is physical state uncertain?
Is data integrity at risk?
Is operator intervention required?
Can we recover automatically?
Can we still trust the workflow state?

That classification drives design.

2. Design for recovery, not just detection

Junior systems focus on detecting error.

Senior systems focus on what happens next.

That means designing:

safe stop flows
re-sync flows
reconnect flows
partial result marking
degraded mode behavior
operator acknowledgement steps
restart and resume rules

A system that detects failures but cannot recover cleanly is not really resilient.

3. Balance robustness vs complexity

You can over-engineer resilience.

If every component has:

retries
fallback modes
buffering
local spooling
reconnect loops
recovery orchestration
circuit breakers
custom state reconciliation

then the system becomes very hard to reason about.

The goal is not “maximum cleverness.” The goal is “predictable behavior under failure.”

Simple, explicit, boring recovery flows are usually better than magical automation.

4. Preserve debuggability under failure

This is a very senior principle.

Many systems become least observable exactly when failure happens.

To avoid that, design failure handling to preserve evidence:

structured logs
operation IDs
machine state snapshot on fault
current recipe and wafer context
workflow step at failure
persistence status of partial outputs
vendor error codes
timestamps around timeout and retry

Under failure, you need more signal, not less.

Final takeaway

In industrial desktop systems, error handling is really about trust.

Can the operator trust what the UI says? Can the engineer trust the saved results? Can the support team reconstruct what happened? Can the machine be brought back to a safe state? Can the workflow recover without hidden corruption?

That is why resilience is not “add try/catch.”

It is:

classify failures correctly
isolate bad dependencies
enforce explicit state transitions
stop safely when trust is lost
retry only when safe
preserve evidence for debugging
design recovery as part of the system, not as an afterthought

If you want, I can turn this into the same interview-prep format you used before: Part 1 knowledge review + Part 2 tough interview Q&A with model answers.

Streaming Pipelines Dotnet Real World

PART 1 — BIG PICTURE ​

Why “just catch exception” is not enough ​

Why resilience is different in desktop + hardware systems ​

Machine disconnection ​

Camera timeout ​

Vendor SDK failure ​

File save failure during inspection ​

PART 2 — HOW IT ACTUALLY WORKS ​

1. Expected errors vs unexpected failures ​

Expected errors ​

Unexpected failures ​

2. Exception handling strategy ​

Hardware adapter boundary ​

Workflow boundary ​

UI command boundary ​

Top-level boundary ​

3. Retry, timeout, fallback, fail-fast ​

Retry ​

Timeout ​

Fallback ​

Fail-fast ​

PART 3 — REAL PROBLEMS IN THIS SYSTEM ​

Machine command timeout ​

Hardware alarm during active inspection ​

Partial workflow completion ​

Losing synchronization between app state and machine state ​

Failure while streaming results or saving images ​

PART 4 — HOW WE USE IT IN .NET (PRACTICAL) ​

1. Exception boundaries ​

Example: hardware adapter boundary ​

2. Timeout patterns ​

3. Safe retry patterns ​

4. Recovery flows after machine or workflow failure ​

5. Logging with enough context ​

6. Operator-safe UI messaging ​

PART 5 — COMMON MISTAKES (VERY REALISTIC) ​

1. catch (Exception) everywhere ​

2. Swallowing errors ​

3. Retrying dangerous operations blindly ​

4. Showing technical exception text directly to operators ​

5. Not restoring system to safe state ​

PART 6 — PERFORMANCE & TRADE-OFFS ​

Retry cost ​

Timeout tuning trade-offs ​

Fail-fast vs aggressive recovery ​

PART 7 — SENIOR ENGINEER THINKING ​

1. Experienced engineers classify failures ​

2. Design for recovery, not just detection ​

3. Balance robustness vs complexity ​

4. Preserve debuggability under failure ​

Final takeaway ​

PART 1 — BIG PICTURE

Why “just catch exception” is not enough

Why resilience is different in desktop + hardware systems

Machine disconnection

Camera timeout

Vendor SDK failure

File save failure during inspection

PART 2 — HOW IT ACTUALLY WORKS

1. Expected errors vs unexpected failures

Expected errors

Unexpected failures

2. Exception handling strategy

Hardware adapter boundary

Workflow boundary

UI command boundary

Top-level boundary

3. Retry, timeout, fallback, fail-fast

Retry

Timeout

Fallback

Fail-fast

PART 3 — REAL PROBLEMS IN THIS SYSTEM

Machine command timeout

Hardware alarm during active inspection

Partial workflow completion

Losing synchronization between app state and machine state

Failure while streaming results or saving images

PART 4 — HOW WE USE IT IN .NET (PRACTICAL)

1. Exception boundaries

Example: hardware adapter boundary

2. Timeout patterns

3. Safe retry patterns

4. Recovery flows after machine or workflow failure

5. Logging with enough context

6. Operator-safe UI messaging

PART 5 — COMMON MISTAKES (VERY REALISTIC)

1. `catch (Exception)` everywhere

2. Swallowing errors

3. Retrying dangerous operations blindly

4. Showing technical exception text directly to operators

5. Not restoring system to safe state

PART 6 — PERFORMANCE & TRADE-OFFS

Retry cost

Timeout tuning trade-offs

Fail-fast vs aggressive recovery

PART 7 — SENIOR ENGINEER THINKING

1. Experienced engineers classify failures

2. Design for recovery, not just detection

3. Balance robustness vs complexity

4. Preserve debuggability under failure

Final takeaway